Nature Medicine — Latest Matching Preprints

1

Human vs AI Clinical Assessment: Benchmarking a Multimodal Foundation Model Against Multi-Center Expert Judgment on the Mental Status Examination.

Mwangi, B.; Jabbar Abdl Sattar Hamoudi, H.; Sanches, M.; Dogan, N.; Chaudhary, P.; Wu, M.-J.; Zunta-Soares, G. B.; Soares, J. C.; Martin, A.; Soutullo, C. A.

2026-04-20 psychiatry and clinical psychology 10.64898/2026.04.17.26351105 medRxiv

Top 0.1%

52.1%

Show abstract

The Mental Status Examination (MSE) is the cornerstone of the psychiatric evaluation, yet validating artificial intelligence (AI) against the inherent variance of clinical judgment remains a critical bottleneck. Here we introduce a multi-center framework to benchmark the open-weight multimodal foundation model Qwen3-Omni against independent expert panels at two sites, UTHealth and Yale. Evaluating 396 classifications across 10 MSE domains and three longitudinal timepoints of increasing symptom severity, we found that experts achieved substantial agreement (Gwets AC1 = 0.87), whereas the model achieved only moderate alignment (AC1 = 0.70-0.72). Even as the models overall pathology prediction rate approximated the experts, the aggregate equilibrium masked a profound "clinical reasoning gap". Specifically, the model systematically over-predicted observable signs (e.g., speech, affect) while notably failing in inferential domains requiring the interpretation of latent mental content (e.g., delusions, perceptions). A 4-bit quantization analysis of the model confirmed this mechanistically: reducing model capacity disproportionately degraded inferential reasoning while preserving perceptual feature extraction. Furthermore, model-to-expert agreement degraded linearly as clinical complexity intensified across longitudinal visits (Accuracy: T0 = 84.8-87%; T1 = 80-82%; T2 = 71-73%), whereas expert consensus remained robust. Notably, model errors increased 2.3-to-3.4 fold where human experts disagreed. These findings establish inter-expert variance as an essential measurable baseline for psychiatric AI, demonstrating that true clinical translation requires models to move beyond multimodal perceptual extraction to achieve higher-order diagnostic reasoning.

2

Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta Analysis

Zhu, L.; Wang, W.; Liang, Z.; Tan, W.; Chen, B.; Lin, X.; Wu, Z.; Yu, H.; Li, X.; Jiao, J.; He, S.; Dai, G.; Niu, J.; Zhong, Y.; Hua, W.; Chan, N. Y.; Lu, L.; Wing, Y. K.; Ma, X.; Fan, L.

2026-04-22 psychiatry and clinical psychology 10.64898/2026.04.21.26351365 medRxiv

Top 0.1%

27.4%

Show abstract

The rapid rise of large language models (LLMs) and foundation models has accelerated efforts to build artificial intelligence (AI) agents for mental health assessment, triage, psychotherapy support and clinical decision assistance. Yet a gap persists between healthcare and AI-focused work: while both communities use the language of "agents," clinical research largely describes monolithic chatbots, whereas AI studies emphasize agentic properties such as autonomous planning, multiagent coordination, tool and database use and integration with multimodal mental health data streams. In this Review, we conduct a systematic analysis of mental health AI agent systems from 2023 to 2025 using a six-dimensional audit framework: (i) system type (base model lineage, interface modality and workflow composition, from rule-based tools to role-aware multi-agent foundation-model systems), (ii) data scope (modalities and provenance, from elicited self-report and chatbot dialogues to electronic health records, biosensing and synthetic corpora), (iii) mental health focus (mapped to ICD-11 diagnostic groupings), (iv) demographics (age strata, geography and sex representation), (v) downstream tasks (screening/triage, clinical decision support, therapeutic interventions, documentation, ethical-legal support and education/simulation) and (vi) evaluation types (automated metrics, language quality benchmarks, safety stress tests, expert review and clinician or patient involvement). Across this corpus, we find that most systems (1) concentrate on depression, anxiety and suicidality, with sparse coverage of severe mental illness, neurocognitive disorders, substance use and complex comorbidity; (2) rely heavily on text-based self-report rather than clinically verified longitudinal data or genuinely multimodal inputs; (3) are implemented as single-agent chatbots powered by general-purpose LLMs rather than role-structured, workflow-integrated pipelines; and (4) are evaluated primarily via offline metrics or vignette-based scenarios, with few prospective, clinician- or patient-in-the-loop studies. At the same time, an emerging class of agentic systems assigns foundation models explicit roles as planners, retrieval agents, safety auditors or supervisors coordinating other models and tools. These multiagent, tool-augmented workflows promise personalization, safety monitoring and greater transparency, but they also introduce new risks around reliability, bias amplification, privacy, regulatory accountability and the blurring of clinical versus non-clinical roles. We conclude by outlining priorities for the next generation of mental health AI agents: clinically grounded, role-aware multi-agent architectures; transparent and privacy-preserving use of clinical and elicited data; demographic and cultural broadening beyond predominantly Western adult samples; and evaluation pipelines that progress from offline benchmarks to longitudinal, real-world studies with routine safety auditing and clear governance of responsibilities between agents and human clinicians.

3

NeuroFM: Toward Precision Neuroimaging with Foundation Models for Individualized Brain Health Estimation

Dibble, A.; Dalby, C.; Sevegnani, M.; Fracasso, A.; Lyall, D. M.; Harvey, M.; Svanera, M.

2026-03-31 neurology 10.64898/2026.03.27.26349489 medRxiv

Top 0.1%

22.5%

Show abstract

Precision neuroimaging aims to deliver individualized assessments of brain health, yet a single structural MRI does not yield a multidimensional, quantitative summary of an individual's current health or future risk. Existing approaches optimize task-specific objectives, yielding representations entangled with cohort- or disease-specific signals rather than capturing biologically grounded patterns of anatomical variation. Here, we introduce NeuroFM, a foundation model trained exclusively on 100,000 healthy synthetic volumes to predict morphometric and demographic targets. Without exposure to diagnostic labels, NeuroFM organizes brain MRIs into population-level patterns that encode meaningful brain health differences. These representations transfer across five neuroscience domains without adaptation and support simple linear readouts for clinical, cognitive, developmental, socio-behavioural, and image quality control. Evaluated on 136,361 real volumes spanning multiple cohorts, NeuroFM generalizes across domains and enables individual-level brain health profiling, estimating future dementia risk years before diagnosis. Together, these findings establish a disease-naive foundation model paradigm for precision neuroimaging.

4

Stabilized gp120-specific CD4 for next-generation HIV-1 inhibitors

Bahn-Suh, A. J.; Caldera, L. F.; Gnanapragasam, P. N. P.; Keeffe, J. R.; Seaman, M. S.; Bjorkman, P. J.; Mayo, S. L.

2026-03-27 bioengineering 10.64898/2026.03.24.713825 medRxiv

Top 0.1%

19.3%

Show abstract

HIV-1 Envs gp120 subunit uses the T-cell coreceptor CD4 to enter host cells in a manner that prevents the evolution of host resistance by sharing the binding epitope with the footprint of CD4s natural ligands, class II MHC proteins1,2. Consequently, CD4-containing biologics, such as CD4-Ig3,4 and derivatives5-9, benefit from this conserved relationship and are promising broad-acting anti-HIV-1 agents that are resistant to viral mutational escape10. However, these biologics suffer from short serum half-lives in humans11,12 and animals3,13, likely due to CD4s poor thermostability14 and/or off-target class II MHC binding15. This latter property also warrants caution for CD4-containing biologics that could indiscriminately recruit Fc-dependent effector functions against uninfected cells and/or compete with host CD4 for class II MHC during T cell interactions with antigen-presenting cells. Here, we describe gp120-specific CD4 (gCD4), which exhibits enhanced thermostability and retains Env, but not class II MHC, binding. CD4-Ig variants incorporating gCD4 did not bind class II MHC on human B cells, displayed greater longevity in human tonsil organoid cultures, showed half-lives equivalent to therapeutic IgG antibodies in mice, and neutralized HIV-1 more broadly and potently compared to the original CD4-Ig molecules. Encouragingly, one variant neutralized 100% of a panel of clinically-relevant HIV-1 strains at titers correlating to infection prevention in humans, outperforming known broadly neutralizing antibodies16,17. Thus, gCD4 holds promise for the development of new CD4-containing biologics with best-in-class specificity, pharmacokinetic properties, and neutralization breadth and potency.

5

Evaluating Large Language Models for Assessment of Psychosis Risk

Zhu, T.; Tashevski, A.; Taquet, M.; Azis, M.; Jani, T.; Broome, M. R.; Kabir, T.; Minichino, A.; Murray, G. K.; Nour, M. M.; Singh, I.; Fusar-Poli, P.; Nevado-Holgado, A.; McGuire, P.; Oliver, D.

2026-04-04 psychiatry and clinical psychology 10.64898/2026.04.02.26349960 medRxiv

Top 0.1%

19.2%

Show abstract

Psychosis prevention relies on early detection of individuals at clinical high risk for psychosis (CHR-P) remains limited, constraining preventive care. The effectiveness of the CHR-P state is constrained, in part due to clinical assessments requiring specialist interpretation of narrative interviews, limiting scalability. Here, we evaluate whether large language models (LLMs; deep learning models trained on large text corpora to process and generate language) can extract clinically meaningful information from such interviews to support psychosis risk assessment. We assessed 11 open-weight LLMs on 678 PSYCHS interview transcripts from 373 participants (77.7% CHR-P). Models inferred CHR-P status and estimated severity and frequency across 15 symptom domains, benchmarked against researcher-rated scores. Larger models achieved the strongest classification performance (Llama-3.3-70B: accuracy = 0.80, sensitivity = 0.93, specificity = 0.58). LLM-generated symptom scores showed good correlations with researcher-rated scores (ICCsev = 0.74, ICCfreq = 0.75). Performance disparities were minimal across most demographic groups but varied across sites. Generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologisation of non-clinical experiences. While accuracy scaled with model size, smaller models achieved competitive performance with substantially lower computational cost. These findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, supporting scalable, human-in-the-loop approaches to early detection.

6

Peer support boosted Hepatitis C treatment access among marginalised populations in England: A Bayesian causal factor analysis.

Schmidt, C.; Samartsidis, P.; Seaman, S.; Emmanouil, B.; Foster, G.; Reid, L.; Smith, S.; De Angelis, D.

2026-04-22 health policy 10.64898/2026.04.20.26351261 medRxiv

Top 0.1%

18.6%

Show abstract

To minimise health disparities, equitable access to medical treatment is paramount. In a pioneering intervention, National Health Service Englands Hepatitis C virus (HCV) programme has implemented country-wide peer support to boost treatment access. Peer support workers (peers) are individuals with relevant lived experience, who promote testing and treatment in marginalised populations underserved by traditional health services. We evaluated the English peers intervention, exploiting its staggered rollout and rich surveillance data between June 2016 and May 2021. Peers increased HCV cases identified by 13{middle dot}9% (95% credible interval (95% CrI) [5{middle dot}3, 21{middle dot}7]), sustained viral responses by 8{middle dot}0% (95% CrI [-4{middle dot}4, 18{middle dot}6]), and drug services referrals by 8{middle dot}8% (95% CrI [-12{middle dot}5, 22{middle dot}6]). The interventions effectiveness was magnified during the first COVID-19 lockdown and individuals supported by peers typically belonged to populations with poor treatment access. Our findings indicate that peers can boost equity in treatment access on a national scale.

7

The Evolutionary Dynamics and Regional Spread of Mpox in Africa: Insights from Multi-country Genomic Surveillance

Tanui, C. K.; Kinganda-Lusamaki, E.; O'Toole, A.; Chitenje, M.; Campbell, A. K. O.; DIAGNE, M. M.; Kanyerezi, S.; Faye, M.; Ifabumuyi, S. O.; Nzoyikorera, N.; Lango, H. O.; Koukouikila-Koussounda, F.; Meite, S.; Sikazwe, E.; Djuicy, D. D.; Adu, B.; MAMAN, I.; Mapunda, L. A.; Nyan, D. C.; Stephane, S.; Aricha, S. A.; Cherif Gnimadi, T. A.; Maror, J. A.; Pereira, A. M.; Atrah, Y. S.; Akanbi, O. A.; Lokilo, E. L.; Makangara-Cigolo, J.-C.; Paku, P. T.; Luakanda, G. N.; Amuri-Aziza, A.; Wawina-Bokalanga, T.; Mugerwa, I.; Nsawotebba, A.; Ayitewala, A.; Williams, A. J.; Folorunso, V.; Mani, S.; Hardi

2026-04-11 infectious diseases 10.64898/2026.04.07.26347884 medRxiv

Top 0.1%

18.4%

Show abstract

The recent MPXV epidemic across Africa revealed extensive viral diversity and complex transmission dynamics, prompting a continent-wide genomic investigation. We analysed 3,450 high-quality MPXV virus whole genomes from 24 African Union Member States, revealing the complex and concurrent circulation of Subclades Ia, Ib, IIa, and IIb. Subclade Ia showed high levels of virus diversity in reservoir hosts in Central Africa, detected through zoonotic transmission and some sustained human outbreak lastly detected. In contrast, Clade Ib exhibited signatures of sustained human to human transmission across Eastern and Southern Africa. Clade IIa remains largely zoonotic in West Africa. Like Ia, IIb shows continued zoonotic transmission, and sustained human outbreak linked to lineage G1 and G2 circulation. Phylogeographic analyses revealed frequent cross border transmission and interconnectedness, which was aligned with both human mobility corridors and international boundaries. For instance, the Democratic Republic of the Congo or Sierra Leone seems to emerge as a source of regional exportation, while the Cameroon and Nigeria, CAR and Cameroon or CAR and DRC interfaces reflected ongoing cross border zoonotic spillovers. These findings underscore the need for harmonised genomic surveillance, APOBEC3-aware triage, and integrated One Health strategies to prevent local outbreaks from escalating into regional epidemics and to inform vaccine deployment and public health preparedness.

8

A Cerebral Frailty Risk Score Integrating Frailty Index and Neuroimaging for Dementia Prediction in the UK Biobank

Kan, C. N.; Chew, J.; Lim, W. S.; Tan, C. H.

2026-04-04 geriatric medicine 10.64898/2026.04.01.26350015 medRxiv

Top 0.1%

18.3%

Show abstract

Frailty is a multisystem clinical syndrome closely linked to cognitive aging, yet its cerebral underpinnings and co-contribution to adverse outcomes remain poorly understood. In 63,509 dementia-free UK Biobank participants (aged 65.0{+/-}7.7), higher frailty index (FI) was associated with multiple neuroimaging markers, including reduced hippocampal volume, decreased cortical thickness, greater white matter hyperintensities burden, and impaired brain diffusion metrics. FI and neuroimaging markers additively increased the risks of incident dementia and mortality. An extreme gradient boosting with accelerated failure time framework highlighted FI and key regional neuroimaging features in dementia risk prediction (nested C-index=0.825, iAUC=0.759). Integrating the top 10 predictors into a novel point-based cerebral frailty risk score (CFRS) showed strong performance in predicting dementia onset (optimism-corrected C-index=0.838, iAUC=0.778), and was robust to the competing risk of mortality. These findings highlight the potential utility of a CFRS framework that integrates cumulative systemic and cerebral vulnerabilities for dementia risk stratification.

9

Greater lean-body-mass decline with tirzepatide than semaglutide in routine care, revealed by body-composition digital phenotyping

Murugadoss, K.; Venkatakrishnan, A.; Soundararajan, V.

2026-04-13 endocrinology 10.64898/2026.04.11.26350687 medRxiv

Top 0.1%

17.9%

Show abstract

GLP-1 receptor agonists induce substantial weight loss, but the extent to which lean tissue and physical function are preserved in routine care remains poorly understood. Using an EHR-linked body-composition digital phenotyping pipeline with LLM-based extraction, we performed a large-scale longitudinal analysis of 670,422 first-episode GLP-1RA users, including 456,742 treated with semaglutide and 213,680 treated with tirzepatide. Among these, 7,965 individuals with paired pre- and post-initiation body-composition measurements were analyzed over 12 months. Tirzepatide was associated with greater relative lean body mass (LBM) loss than semaglutide at each measured time point, with excess LBM losses of 1.1%, 1.5%, 1.3% and 2% at 3, 6, 9 and 12 months, respectively. A Depletive GLP-1 metabotype, defined as >20% total body weight (TBW) loss with >5% LBM loss, was significantly more frequent with tirzepatide than semaglutide during the first year of therapy (10.3% versus 6.7%, p<0.001). By contrast, a Prime GLP-1 metabotype, defined as >10% TBW loss with <5% LBM loss, was numerically more frequent with semaglutide than tirzepatide, but not significantly so (12.3% versus 11.8%, p=0.66). Higher drug dose and longer exposure were associated with progressively greater LBM decline in both treatment groups (both p<0.001). Among 3,746 examined EHR phenotypes, baseline musculoskeletal pain emerged as the most significant correlate of greater LBM loss (BH-adjusted q<0.001): cervicalgia (semaglutide, -4.1 percentage points; tirzepatide, -14.3 percentage points) and knee pain (semaglutide, -4.8 percentage points; tirzepatide, -13.4 percentage points), consistent with mobility-limited patients being more vulnerable to lean-tissue depletion during incretin therapy. Analysis of EHR notes for on-treatment functional features showed reduced exercise tolerance was the strongest correlate of greater LBM loss, increasing by 7.2 and 11.1 percentage points in semaglutide- and tirzepatide-treated patients, respectively. An independent analysis of all available Single-cell RNA-seq data from human musculature showed broader GIPR+ cellular distribution than GLP1R+ cells across immune, stromal, vascular, and contractile compartments, providing plausible biological context for the greater LBM loss observed in routine care with tirzepatide (dual GLP1R-GIPR agonist) relative to semaglutide (GLP1R-specific agonist). In this observational study, greater weight-loss efficacy did not necessarily translate into more favorable body-composition outcomes, underscoring the need for clinical decision-making and trial designs that maximize each patient's likelihood of achieving a Prime GLP-1 metabotype.

10

Vision-language framework for multi-sequence brain magnetic resonance imaging

Lteif, D.; Jia, S.; Bit, S.; Kaliaev, A.; Mian, A. Z.; Small, J. E.; Mangaleswaran, B.; Plummer, B. A.; Bargal, S. A.; Au, R.; Kolachalama, V. B.

2026-04-04 radiology and imaging 10.64898/2026.03.30.26349106 medRxiv

Top 0.1%

14.8%

Show abstract

Structural magnetic resonance imaging (MRI) is a cornerstone for diagnosing neurological disorders, yet automated interpretation of multi-sequence brain MRI remains limited by challenges in cross sequence reasoning and protocol variability. Here we present ReMIND, a vision-language modeling framework tailored for comprehensive multi-sequence and multi volumetric brain MRI analysis. Trained on over 73,000 deidentified patient visits encompassing more than 850,000 MRI sequences paired with radiology reports from diverse clinical and research cohorts, ReMIND combined large scale instruction tuning on more than one million clinically grounded question answer (QA) pairs with targeted supervised fine-tuning for radiology report generation. At inference, ReMIND employed modality aware reranking and correction, a report level decoding strategy that suppressed unsupported modality claims while preserving linguistic fluency and clinical coherence. Cross-cohort generalization was maintained on independent external datasets from different institutions. These findings represent an advance toward consistent and equitable brain MRI interpretation, meriting prospective evaluation to support diagnosis and management of neurological conditions.

11

Generational gains in memory capacity and stability may account for declining dementia incidence rates in Europe and the United States

Fjell, A. M. M.; Grodem, E. O. S. O. S.; Lunansky, G.; Vidal-Pineiro, D.; Rogeberg, O. J.; Walhovd, K. B.

2026-04-15 neurology 10.64898/2026.04.14.26350835 medRxiv

Top 0.1%

14.4%

Show abstract

Dementia incidence has been declining in Western societies for decades, but whether this reflects higher cognitive capacity entering old age, slower cognitive decline, or both remains unresolved. Analysing ~783,000 episodic memory assessments from ~219,000 individuals across five longitudinal cohorts, we find that later-born cohorts benefit from a double dividend: higher memory levels entering old age and slower rates of decline. The projected 20-year cohort advantage at age 80 is of sufficient magnitude to plausibly account for the observed 13% per-decade decline in dementia incidence reported in meta-analyses. Generational gains are disproportionately concentrated among the fastest-declining individuals, and are reflected in lower hippocampal atrophy rates in an independent sample. A formal bounding analysis shows that the double dividend is robust across a range of plausible period assumptions, consistent with environmental conditions operating across the lifespan having reshaped the architecture of human cognitive aging.

12

Identify Patients at Risk of HIV Using a Clinical Large Language Model from Electronic Health Records

Liu, Y.; Chen, Z.; Suman, P.; Cho, H.; Prosperi, M.; Wu, Y.

2026-04-23 hiv aids 10.64898/2026.04.21.26351427 medRxiv

Top 0.1%

12.6%

Show abstract

This study developed a large language model (LLM)-based solution to identify people at HIV risk using electronic health records. We transformed structured EHR data, including demographics, diagnoses, and medications, into narrative descriptions ordered by visit date and applied GatorTron, a widely used clinical LLM trained on 82 billion words of de-identified clinical text. We compared GatorTron with traditional machine learning models, including LASSO and XGBoost. We identified a cohort with 54,265 individuals, where only 3,342 (6%) had new HIV diagnoses. Our LLM solution, based on GatorTron, achieved excellent performance, reaching an F1 score of 53.5% and an AUC of 0.88, comparable to traditional machine learning approaches. Subgroup analysis showed that, across age, sex, and race/ethnicity groups, both LLM and traditional models achieved AUCs above 0.82. Interpretability analyses showed broadly consistent patterns across LLM models and traditional machine learning models.

13

Digital journaling enables privacy-preserving behavioral phenotyping and real-time risk monitoring at scale

Milham, M.; Low, D.; Erkent, A.; Trabulsi, J.; Kass, M. C.; Vos de Wael, R.; Yenepalli, S.; Wang, Y.; Leyden, M.; Jordan, C.; Salum, G.; Alexander, L.; Schubiner, G.; Hendrix, L.; Koyama, M.; Mears, L.; McAdams, R.; White, C.; Merikangas, K.; Satterthwaite, T. D.; Franco, A.; Klein, A.; Koplewicz, H.; Leventhal, B.; Freund, M.; Kiar, G.

2026-04-08 psychiatry and clinical psychology 10.64898/2026.04.04.26349881 medRxiv

Top 0.1%

12.4%

Show abstract

Digital mental health applications enable high-frequency behavioral monitoring and scalable interventions. Journaling provides a therapeutically grounded and intrinsically engaging activity for many users. AI-based text analysis enables privacy-preserving phenotyping of clinically relevant patterns in naturalistic writing, including emotional distress and behavioral risk (e.g., indicators of intent, planning, or preparatory actions for harm to self or others). We evaluated a mobile journaling platform in an 8-week randomized controlled trial (N = 507) of young adults with mild-to-moderate anxiety and depression symptoms. Journaling produced modest reductions in anxiety relative to controls at the 8-week endpoint and 1-month follow-up (d = 0.16-0.19). Effects were small and did not remain significant after correction for multiple comparisons; complementary Bayesian models nonetheless provided moderate-to-strong directional evidence (90-97%) supporting a modest anxiety reduction. In parallel, behavioral phenotyping analyses showed that high-risk journal entries were more common among younger users (OR = 0.77 per year of age, p = 0.007). Text-based risk signals and self-reported energy exhibited significant circadian variation (e.g., risk probability was highest during late-night and overnight hours). Within-person analyses demonstrated strong short-term persistence in mood and risk states, with calm/relaxed showing the highest persistence and anxious/agitated exhibiting the lowest persistence. High-risk journal entries clustered temporally and were preceded by sustained low valence and energy. Although affective volatility was associated with acute declines within the same affective dimension (pleasantness or energy), it was not associated with escalation to high-risk states. Key behavioral dynamics observed in the trial were replicated in an independent general population dataset (N = 16,630). Collectively, these findings demonstrate that privacy-preserving digital journaling can support scalable longitudinal behavioral phenotyping and real-time risk monitoring while providing modest clinical benefit for anxiety symptoms.

14

Plasma proteomics link menopause timing to brain aging and dementia risk

Wood Alexander, M.; Wood, B.; Oh, H. S.-H.; Bot, V. A.; Borger, J.; Galbiati, F.; Walker, K. A.; Resnick, S. M.; Ochs-Balcom, H. M.; Wyss-Coray, T.; Kooperberg, C.; Reiner, A. P.; Jacobs, E. G.; Rabin, J. S.; Casaletto, K. B.; Saloner, R.

2026-04-24 neurology 10.64898/2026.04.23.26351500 medRxiv

Top 0.1%

12.3%

Show abstract

Earlier menopause is a risk factor for several age-related diseases, including dementia. The biological pathways linking menopause timing to later-life brain aging are not understood. Leveraging large-scale plasma proteomics in postmenopausal women from the UK Biobank (N=15,012), earlier menopause was associated with upregulation of pro-inflammatory and extracellular matrix degradation pathways, plus accelerated aging across proteomic clocks of organ and cellular aging, including brain and oligodendrocyte aging. Elevated GDF15, a canonical aging marker, was the top protein correlate of earlier menopause. We observed robust replication of menopause timing proteomic shifts in the Women's Health Initiative Long Life Study (N=1,210). In UKB, proteins associated with earlier menopause, including GDF15, exhibited concordant associations with incident dementia risk and brain atrophy, cerebral small vessel disease burden, and white matter microstructural integrity. Collectively, our findings identify proteomic signatures linking ovarian aging to brain aging, providing a framework to inform interventions to reduce dementia risk.

15

Ad-verse Effects: Pharmaceutical Advertising Shifts Drug Recommendations by Consumer-Facing AI

Omar, M.; Agbareia, R.; McGreevy, J.; Zebrowski, A.; Ramaswamy, A.; Gorin, M.; Anato, E. M.; Glicksberg, B. S.; Sakhuja, A.; Charney, A.; Klang, E.; Nadkarni, G.

2026-04-16 health policy 10.64898/2026.04.14.26350868 medRxiv

Top 0.1%

12.3%

Show abstract

Large language models are increasingly used for clinical guidance while their parent companies introduce advertising. We tested whether pharmaceutical ads embedded in the prompts of 12 models from OpenAI, Anthropic, and Google shift drug recommendations across 258,660 API calls and four experiments probing distinct epistemic conditions. When two drugs were both guideline appropriate, advertising shifted selection of the advertised drug by +12.7 percentage points (P < 0.001), with some model scenario pairs shifting from 0% to 100%. Google models were the most susceptible (+29.8 pp), followed by OpenAI (+10.9 pp), while Anthropic models showed minimal change (+2.0 pp). When the advertised product lacked evidence or was clinically suboptimal, models resisted. This reveals a structured vulnerability: advertising does not override medical knowledge but fills the space where clinical evidence is underdetermined. An open response sub analysis (2,340 calls across three representative models) confirmed that advertising restructures free-text clinical reasoning: models echoed ad claims at 2.7 times the baseline rate while maintaining high stated confidence and rarely disclosing the ad. Susceptibility was provider dependent (Google: +29.8 pp; OpenAI: +10.9 pp; Anthropic: +2.0 pp). Because this bias operates within clinically correct answers, it is invisible to accuracy based evaluation, identifying a class of AI safety vulnerability that standard testing cannot detect.

16

Multi-BOUNTI: Multi-lobe Brain vOlUmetry and segmeNtation for feTal and neonatal MRI

Uus, A.; Fukami-Gartner, A.; Kyriakopoulou, V.; Cromb, D.; Morgan, T.; Arulkumaran, S.; Egloff Collado, A.; Luis, A.; Bos, R.; Makropoulos, A.; Schuh, A.; Robinson, E.; Sousa, H.; Deprez, M.; Cordero-Grande, L.; Bradshaw, C.; Colford, K.; Hutter, J.; Price, A.; O'Muircheartaigh, J.; Hammers, A.; Rueckert, D.; Counsell, S.; McAlonan, G.; Arichi, T.; Edwards, A. D.; Hajnal, J. V.; Rutherford, M. A.; Story, L.

2026-04-22 pediatrics 10.64898/2026.04.21.26351376 medRxiv

Top 0.1%

12.2%

Show abstract

Regional volumetric assessment of perinatal brain development is currently limited by the lack of consistent high quality multi-regional segmentation methods applicable to both fetal and neonatal MRI. We present Multi-BOUNTI, a deep learning pipeline for automated multi-lobe segmentation of fetal and neonatal T2w brain MRI. The method is based on a dedicated 43-label parcellation protocol and a 3D Attention U-Net trained on brain MRI datasets of subjects spanning 21-44 weeks gestational/postmenstrual age. The pipeline integrates preprocessing, segmentation and volumetric analysis, and was evaluated on independent datasets, demonstrating fast (< 10 min/case) and accurate performance with high agreement to manually refined labels. We demonstrate the application of the framework with 267 fetal and 593 neonatal MRI datasets from the developing Human Connectome Project without reported clinically significant brain anomalies to derive normative volumetric growth models across 21-44 weeks GA/PMA. These models were used to characterise developmental trajectories, assess differences between fetal and preterm neonatal cohorts, and analyse longitudinal changes. The resulting normative models were integrated into an automated reporting framework enabling subject-specific volumetric assessment via centiles and z-scores. Multi-BOUNTI provides a unified and scalable approach for perinatal brain segmentation and volumetry, supporting large-scale studies and facilitating future clinical translation. The full pipeline is publicly available at https://github.com/SVRTK/perinatal-brain-mri-analysis.

17

Predicting long-term adverse outcomes after neonatal intensive care

Ogretir, M.; Kaipainen, V.; Leskinen, M.; Lahdesmaki, H.; Koskinen, M.

2026-03-31 pediatrics 10.64898/2026.03.26.26348580 medRxiv

Top 0.1%

10.1%

Show abstract

Neonates requiring intensive care are at increased risk for long-term neuropsychiatric disorders. However, clinical adoption of risk prediction models remains limited when their performance lacks adequate interpretability for informed clinical decision-making. Here, we investigated whether longitudinal neonatal electronic health record (EHR) data from the first 90 days of life can support clinically meaningful interpretation of long-term risk signals for major neuropsychiatric diagnoses by age seven. In a retrospective register-based cohort of 17,655 at-risk children from an academic medical center, of whom 8.0\% (1,420) received a major neuropsychiatric diagnosis during follow-up, we applied a time-aware transformer model (Self-supervised Transformer for Time-Series; STraTS) and thoroughly evaluated its predictions using three complementary interpretability approaches: perturbation-based variable importance, value-dependent effect analysis, and leave-one-out (LOO) feature attribution. STraTS achieved the highest area under the precision--recall curve (AUPRC 0.171 {+/-} 0.022), compared with Random Forest (0.166 {+/-} 0.008), logistic regression (0.151 {+/-} 0.007), and XGBoost (0.128 {+/-} 0.010). Across interpretability methods, five predictors were consistently identified: birth weight, gender, Apgar score at 1 minute, umbilical serum thyroid stimulating hormone (uS-TSH), and treatment time in hospital. Indicators of early clinical severity, including chromosomal abnormalities and neonatal cerebral-status disturbances, showed the largest risk-increasing effects. Furthermore, the model's learned vector representations of subject-specific EHR sequences formed clinically coherent latent embeddings that reflect population heterogeneity along established perinatal risk dimensions. These findings demonstrate that combining multiple complementary interpretability methods yields stable, clinically plausible risk signals while revealing limitations that would remain undetected by any single approach, highlighting the importance of careful interpretability analysis of deep learning-based risk predictions.

18

Heart Failure Prediction & Risk Stratification using Machine Learning

Ali, S.; Leavitt, M. A.; Asghar, W.

2026-04-05 public and global health 10.64898/2026.04.03.26350139 medRxiv

Top 0.1%

10.0%

Show abstract

Heart failure (HF) is one of the most prevalent causes of morbidity, mortality, and healthcare expenditures, with approximately 6.7 million adults in the U.S. suffering from this condition and contributing to hundreds of thousands of deaths annually. Early diagnosis of high-risk individuals has been a challenge, as the HF-specific symptoms are often ignored or misinterpreted as normal aging, stress, or minor illnesses, leading to delayed diagnosis. We trained, tested, and evaluated several models, including logistic regression, SVM, KNN, random forest, XGBoost, MLP, and a custom stacked ensemble using stratified 5-fold CV and 70/30 hold-out splits for HF prediction on routinely available electronic medical record (EMR) data of the All of Us Research Program. This group consisted of 37,070 adults (13,577 HF; 23,493 non-HF). The predictors included readily available demographics, vital signs, conditions, and laboratory results. Preprocessing steps included IQR-winsorization, median imputation, one-hot encoding, and QuantileTransformer. The stacked model obtained ROC-AUC 0.927, PR-AUC 0.895, and accuracy 0.856 in the test set. To support real-world deployment, we calibrated predicted probabilities and adjusted them to a realistic population prevalence, yielding interpretable probability estimates and clear stratification of individuals into clinically actionable risk tiers. SHAP analysis identified the most influential features, namely, atrial fibrillation, age, hypertensive disorder, sodium, and deprivation index, as the top 5 features impacting the model?s prediction. A secondary multiclass experiment (No-HF, HF with reduced ejection fraction, and HF with preserved ejection fraction) was performed, achieving lower discrimination results (macro-AUC ~0.87) and a lower per-class precision/recall, presumably due to label noise, class imbalance, and overlapping phenotypes. We have demonstrated that a carefully calibrated stacked ensemble on the combination of readily available EMR variables can achieve strong discrimination on HF, making it an effective tool for an AI clinical decision support system (AI-CDSS) in population screening and proactive care pathways.

19

Legacy neuropsychiatric benefit after semaglutide is linked to maximum achieved dose and independent of the maximum weight lost

murugadoss, k.; Venkatakrishnan, A.; Soundararajan, V.

2026-04-23 endocrinology 10.64898/2026.04.16.26351060 medRxiv

Top 0.1%

9.9%

Show abstract

GLP-1 receptor agonists have reshaped obesity therapeutics, but their impact on neuropsychiatric outcomes remains poorly characterized. From 29 million patients in a large federated data platform across the USA, including 489,785 semaglutide treated patients, we conducted an observational study integrating longitudinal neuropsychiatric outcomes. From this population, we assembled a cohort of 63,215 patients with baseline neuropsychiatric conditions before treatment initiation and evaluated 24 incident neuropsychiatric outcomes. In propensity-matched comparator analyses, during the 2 year time-period from treatment initiation, semaglutide was associated with broadly lower neuropsychiatric event risk than metformin, SGLT2 inhibitors, and DPP-4 inhibitors. Within the semaglutide-treated cohort, higher attained dose during the first two years after the first prescription ("pre-landmark period") was associated with significantly lower incidence during the following two years ("post-landmark period") of diagnostic codes associated with substance-related disorders (P<0.001), mood disorders (P<0.001), anxiety- and stress-related disorders (P<0.001), CNS atrophies (P<0.001), neuromuscular disorders (P=0.013), eating/sleep/behavioral disorders (P=0.022), and personality/impulse-control disorders (P=0.028). Consistent with previous clinical trials, the post-landmark incidence of dementia or CNS degenerative diseases was similar between the high-dose and low-dose semaglutide cohorts (P=0.15). For most neuropsychiatric diagnoses, post-landmark incidence was strongly associated with the maximum attained semaglutide dose during the pre-landmark period, but incident cognitive symptoms and speech/language symptoms were more closely linked to the pre-landmark weight-loss magnitude (p<0.001 and p<0.003, respectively). Bulk and single-cell transcriptomic analyses demonstrated GLP1R expression in CNS tissues (hypothalamus, caudate, putamen, nucleus accumbens, cerebellum) and peripheral nerves. Age-associated heterogeneity in GLP1R expression was evident in several of these compartments including the caudate nucleus, suggesting dynamic changes in the availability of the neurobiological substrate for semaglutide response. Together, these data support a model in which semaglutide confers a sustained, dose-dependent, weight loss-independent benefit across multiple neuropsychiatric conditions via direct CNS target engagement. This observational study motivates prospective clinical studies and mechanistic analyses to clarify the impact of GLP-1 receptor agonists on human neuropsychiatric pathways and disease processes.

20

Causal Machine Learning for Comparative Effectiveness of GLP-1 RA versus SGLT2i in Heart Failure Using Real-World EHR Data

Han, G. Y.; Kalogeropoulos, A. P.; Butzin-Dozier, Z.; Wong, R.; Wang, F.

2026-04-07 cardiovascular medicine 10.64898/2026.04.06.26350259 medRxiv

Top 0.1%

9.9%

Show abstract

Clinicians lack precision medicine tools to estimate individualized treatment effects for patients with heart failure (HF). Causal machine learning leveraging electronic health records can estimate both average and individualized treatment effects, enabling estimation of treatment heterogeneity. Using Stony Brook University Hospital data, we compared the effectiveness of glucagon-like peptide-1 receptor agonists (GLP-1 RA) versus sodium-glucose cotransporter 2 inhibitors (SGLT2i) in patients with HF. Under a doubly robust framework, we found a stable population-average effect: GLP-1 RA was associated with a lower risk than SGLT2i for a 1-year composite outcome of all-cause mortality or HF-related hospitalization. Heterogeneity analyses provided limited evidence for individualized treatment selection, although subgroup tests identified loop diuretic use, body mass index, and estimated glomerular filtration rate as potential effect modifiers. While these models hold promise for translating observational data into actionable precision care, careful assessment of causal assumptions and rigorous validation are essential before clinical implementation.